On Sharing Private Data with Multiple Non-Colluding 

Adversaries 

Theodoros Rekatsinas Amol Deshpande Ashwin Machanavajjhala 

University of Maryland University of Maryland Duke University 

thodrek(a)cs.umd.edu amoKScs.umd.edu ashwirKScs.duke.edu 



ABSTRACT 

We present SPARSI, a novel theoretical framework for partition- 
ing sensitive data across multiple non-colluding adversaries. Most 
work in privacy-aware data sharing has considered disclosing sum- 
maries where the aggregate information about the data is preserved, 
but sensitive user information is protected. Nonetheless, there are 
applications, including online advertising, cloud computing and 
crowdsourcing markets, where detailed and fine-grained user-data 
must be disclosed. We consider a new data sharing paradigm and 
introduce the problem of privacy-aware data partitioning, where 
a sensitive dataset must be partitioned among k untrusted parties 
{adversaries). The goal is to maximize the utility derived by par- 
titioning and distributing the dataset, while minimizing the total 
amount of sensitive information disclosed. The data should be dis- 
tributed so that an adversary, without colluding with other adver- 
saries, cannot draw additional inferences about the private informa- 
tion, by linking together multiple pieces of information released to 
her. The assumption of no collusion is both reasonable and neces- 
sary in the above application domains that require release of private 
user information. SPARSI enables us to formally define privacy- 
aware data partitioning using the notion of sensitive properties for 
modeling private information and a hypergraph representation for 
describing the interdependencies between data entries and private 
information. We show that solving privacy-aware partitioning is, 
in general, NP-hard, but for specific information disclosure func- 
tions, good approximate solutions can be found using relaxation 
techniques. Finally, we present a local search algorithm applicable 
to generic information disclosure functions. We apply SPARSI to- 
gether with the proposed algorithms on data from a real advertising 
scenario and show that we can partition data with no disclosure to 
any single advertiser. 

1. INTRODUCTION 

The landscape of online services has changed significantly in the 
recent years. More and more sensitive information is released on 
the Web and is processed by online services. The most common 
paradigm to consider are people who rely on online social networks 
to conrniunicate and share information with each other. This leads 



to a diverse collection of voluntarily published user data. Online 
services such as Web search, news portals, recommendation and 
e-commerce systems, collect and store this data in their effort to 
provide high-quality personalized experiences to a heterogeneous 
user base. Naturally, this leads to increased concerns related to an 
individual's privacy and the possibility of private personal informa- 
tion being aggregated by untrusted third-parties such as advertisers. 

A different application domain that is increasingly popular is 
crowdsourcing markets. Tasks, typically decomposed into micro- 
tasks, are submitted by users to a crowdsourcing market and are 
fulfilled by a collection of workers. The user needs to provide 
each worker with the necessary data to accompUsh each micro- 
task. However, this data may contain information that is sensitive 
and care must be taken not to disclose any more sensitive informa- 
tion than minimally needed to accomplish the task. Consider, for 
example, the task of labeling a dataset that contains information 
about the location of different individuals, that needs to be used 
as input to a machine learning algorithm. Since the cost of hand- 
labeling the dataset is high, submitting this task to a crowdsourcing 
market provides an inexpensive alternative. However, the dataset 
might contain sensitive information about the trajectories the indi- 
viduals follow as well as the structure of the social network they 
form. Hence, we must perform a clever partitioning of the dataset 
to the different untrusted workers in order to avoid disclosing sen- 
sitive information. Observe, that under this paradigm, the sensitive 
information contained in the dataset is not necessarily associated 
with a particular data entry. 

Similarly with the rise of cloud computing, increasing volumes 
of private data are being stored and processed on untrusted servers 
in the cloud. Even if the data is stored and processed in an en- 
crypted form, an adversary may be able to infer some of the private 
information by aggregating, over a period of time, the information 
that is available to it (e.g., password hashes of users, workload in- 
formation). This has led security researchers to recommend split- 
ting data and workloads across systems or organizations to remove 
such points of compromise. 

In all applications presented above, a party, called publisher, is 
required to distribute a collection of data (e.g., user information) 
to many different third parties. The utility in sharing data results 
either from the improved quality of personalized services or from 
the cost reduction in fulfilling a decomposable task. The sensitive 
information is often not limited to the identity of a particular entity 
in the dataset (e.g., a user using a social network based service), 
but rather arises from the combination of a set of data items. It is 
these sets we would like to partition accross different adversaries. 
We next use two real-world examples to illustrate this. 



Example 1 . Consider a location based social network, such 
as Gowall^and Brightkit^ where users check-in at different places 
they visit The available data contains information about the loca- 
tions of the users at different time instances and the structure of the 
social network connecting the users. User location data is of par- 
ticular interest to advertisers, as analyzing it can provide them with 
a rather detailed profile of the habits of the user Using such data 
allows advertisers to devise highly efficient personalized marketing 
strategies. Hence, they are willing to pay large amounts of money 
to the data publisher for user information. However, analyzing the 
location of multiple users collectively can reveal information about 
the friendship links between users, thus, revealing the structure of 
the social network Disclosing the structure of the social net- 
work might not be desirable by the online social network provider, 
as it can be used for viral marketing purposes, which may drive 
users away from using the social network. It is easy to see that a 
natural tradeoff exists between publishing user data, and receiving 
high monetary utility, versus keeping this data private to ensure the 
popularity of the social network. 

This example shows how an adversary may infer some sensi- 
tive information that is not explicitly mentioned in the dataset but 
is related to the provided data and can be inferred only when par- 
ticular entries of the dataset are collectively analyzed. Not reveal- 
ing all those entries together to an adversary prevents disclosure of 
the sensitive information. We further exemplify this setup using a 
crowdsourcing application. 

Example 2. Consider a data publisher with a collection of 
medical prescriptions to be transcribed. Each prescription con- 
tains sensitive information, such as the disease of the patient, the 
prescribed medication, the identity of the patient, and the identity of 
the doctor. Furthermore, the publisher would like to minimize the 
total cost of the transcription. Thus, she considers using a crowd- 
sourcing solution where she partitions the task into micro-tasks to 
be submitted to multiple workers. It is obvious that if all fields in 
the prescription are revealed to the same worker, highly sensitive 
information is disclosed. However, if the dataset is partitioned in 
such a way that different workers are responsible for transcribing 
different fields of one prescription, no information is disclosed as 
patients cannot be linked with a particular disease or a particular 
doctor In this case, the utility of the publisher stems from fulfilling 
the task at a reduced cost. 

Despite being simplistic, the second example illustrates how dis- 
tributing a dataset can allow one to use it for a particular task, while 
minimizing the disclosure of sensitive information. Motivated by 
applications such as the ones presented above, we introduce the 
problem of privacy-aware partitioning of a dataset, where our goal 
is to partition a dataset among k untrusted parties and to maximize 
either user's utility, or the third parties' utilities, or a combination 
of those. Further, we would like to do this while minimizing the 
total amount of sensitive information disclosed. 

Most of the previous work has either considered sharing pri- 
vacy-preser\'ing summaries of data, where the aggregate informa- 
tion about the population of users is preserved, or has bypassed the 
use of personal data and its disclosure to multiple advertisers |20[ 
|17[ |12| . These approaches focus on worst-case scenarios assum- 
ing arbitrary collusion among adversaries. Therefore, all adver- 
saries are combined and treated as a single adversary. However, 
this strong threat model does not allow publishing of fine-grained 
information. Other approaches have explicitly focused on online 

'htt p : //en . wikipedia . org/ wiki/ Gowalla| 
^^http : / /en . wikipedia . org/ wiki /Bright kite 



advertising, and have developed specialized systems that limit the 
disclosure of sensitive user-related information when deployed to 
a user's Web browser (11||21| . Finally, Krause et al. have studied 
how the disclosure of a subset of the attributes of a data entry can 
allow access to fine-grained information 1 14|. While they examine 
the utility and disclosure tradeoff, their proposed framework does 
not take into account the interdependencies across different data 
entries and assumes a single adversary (third party) 

In this work we propose SPARSI a new framework that allows us 
to formally reason about leakage of sensitive information in scenar- 
ios such as the ones presented above, namely, setups where we are 
given a dataset to be partitioned among a set of non-colluding ad- 
versaries in order to obtain some utility. We consider a generalized 
form of utility that captures both the utility that each adversary ob- 
tains by receiving part of the data and the user's personal utility de- 
rived by fulfilling a task. We elaborate more on this generalization 
in the next section. This raises a natural tradeoff between maximiz- 
ing the overall utility while minimizing information disclosure. We 
provide a formal definition of the privacy-aware data partitioning 
problem, as an optimization of the aforementioned tradeoff. 

While non-collusion results in a somewhat weaker threat model, 
we argue that it is a reasonable and practical assumption in a va- 
riety of scenarios, including the ones discussed above. In setups 
like online advertising and cloud computing there is no particular 
incentive for adversaries to collude, due to conflicting monetary in- 
terests. In crowdsourcing scenarios the probability that adversaries 
who may collude will be assigned to the same task is minuscule due 
to the large number of anonymous available workers. Attempts to 
collude can often be detected easily, and the possibility of strict pe- 
nalization (by the crowdsourcing market) provides additional dis- 
incentive to collude. Finally, we note that, an assumption of no 
collusion is a necessary and a practical one in most of these situa- 
tions; otherwise there would be no way to accomplish those tasks. 

The main contributions of this paper are as follows: 

• We introduce the problem of privacy-aware data partitioning 
across multiple adversaries, and analyze its complexity. To 
our knowledge this is the first work that addresses the problem 
of minimizing information leakage when partitioning a dataset 
across multiple adversaries. 

• We introduce SPARSI, a rigorous framework based on the no- 
tion of sensitive properties that allows us to formally reason 
about how infonnation is leaked and the total amount of infor- 
mation disclosure. We represent the interdependencies between 
data and sensitive properties using a hypergraph and we show 
how the problem of privacy-aware partitioning can be cast as 
an optimization problem that it is NP-hard by reducing it to hy- 
pergraph partitioning. 

• We analyze the problem for specific families of information dis- 
closure functions, including step and linear functions, and show 
how good solutions can be derived by using relaxation tech- 
niques. Furthermore, we propose a set of algorithms, based on 
a generic greedy randomized local search algorithm, for obtain- 
ing approximate solutions to this problem under generic fami- 
lies of utility and information disclosure functions. 

• Finally, we demonstrate how, using SPARSI, one can distribute 
user-location data, like in Example [T] to multiple advertisers 
while ensuring that almost no sensitive information about po- 
tential user friendship links is revealed. Moreover, we experi- 
mentally verify the performance of the proposed algorithms for 
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both synthetic and real-world datasets. We compare the perfor- 
mance of the proposed greedy local search algorithm against 
approaches tailored to specific disclosure functions, and show 
that it is capable of producing solutions that are close to the 
optimal. 

2. SPARSI FRAMEWORK 

In this section we start by describing the different components 
of SPARSI. More precisely, we show how one can formally reason 
about the sensitive information contained in a dataset by introduc- 
ing the notion of sensitive properties. Then, we show how to model 
the interdependencies between data entries and sensitive properties 
rigorously, and how to reason about the leakage of sensitive infor- 
mation in a principled manner. 

2.1 Data Entries and Sensitive Information 

Let D denote the dataset to be partitioned among different adver- 
saries. Moreover, let A denote the set of adversaries. We assume 
that D is comprised of data entries di £ D that disclose minimal 
sensitive information if revealed alone. To clarify this consider Ex- 
ample [T] where each data entry is the check-in location of a user. 
The user is sharing this information voluntarily with the social net- 
work service in exchange for local advertisement services, hence, 
this entry is assumed not to disclose sensitive information. In Ex- 
ample |2] the data entries to be published are the fields of the pre- 
scriptions. Observe that if the disease field is revealed in isolation, 
no information is leaked about possible individuals carrying it. 

However, revealing several data entries together discloses sensi- 
tive information. We define a sensitive property to be a property 
that is related to a subset of data entries but not explicitly repre- 
sented in the data set, and that can be inferred if the data entries 
are collectively analyzed. Let P denote the set of sensitive proper- 
ties that are related to data entries in D. To formalize this abstract 
notion of indirect information disclosure, we assume that each sen- 
sitive property p £ Pis associated with a variable (either numerical 
or categorical) Vp with true value v*. Let Dp C D be the smallest 
set of data entries from which an adversary can infer the true value 
Vp of Vp with high confidence, if all entries in Dp are revealed to 
her. We assume that there is a unique such Dp corresponding to 
each property p. We say that data entries in Dp disclose informa- 
tion about property p £ P and that information disclosure can be 
modeled as a function over Dp (see Section [2^ . 

We assume that sensitive properties are specified by an expert 
and the dependencies between data entries in D and properties in 
P, via sets Dp, Vp G P, are represented as an undirected bipar- 
tite graph, called a dependency graph. Returning to the example 
applications presented above we have the following: In Example[T] 
the sensitive properties correspond to the friendship links between 
users, and the assosiated datasets Dp correspond to the check-in 
information of the pairs of users participating in friendship links. 
In Example |2] the sensitive properties correspond to the links be- 
tween a patient's id and a particular disease, or a doctor's id and 
particular medication. In general, it has been shown that data min- 
ing techniques can be used to determine the dependencies between 
data items and sensitive properties |16J. 

Let Qd denote such a dependency graph. Qd has two types of 
nodes, i.e., nodes P that correspond to sensitive properties and 
nodes D that correspond to data entries. An edge connects a data 
entry d £ D with a property p £ P only if d can potentially dis- 
close information about p. Alternatively, we can use an equivalent 
hypergraph representation, that is easier to reason about in some 
cases. Converting the dependency graph Qd into an equivalent de- 
pendency hypergraph is simply done by mapping each property 
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Figure 1: An example of a dependency graph between data en- 
tries and sensitive properties. Data entries corresponding to the 
same user are colored using the same color. 



node into a hyperedge. We assume that the dimension (i.e., size 
of largest hyperedge) of the dependency hypergraph is bigger than 
the number of adversaries. 

An example of a bipartite graph and its equivalent hypergraph is 
shown in Figure [T] Recall that in this example we do not want to 
disclose any information about the structure of the social network, 
i.e., the sensitive properties are the friendship links between indi- 
viduals. However, if an adversary is given the check-in locations 
of two individuals, she can infer whether there is a friendship link 
or not between them |2|. The dependencies between check-ins and 
friendship links are captured by the edges in the bipartite graph. 

2.2 Information Disclosure 

We model the information disclosed to an adversary a using a 
vector valued function fa ■ ViD) — >■ [0, 1]'^', which takes as in- 
put the subset of data entries published to an adversary, and returns 
a vector of disclosure values; one per sensitive property. That is , 
fa{Sa)[i] denotes the information disclosued to adversary a £ A 
about the ith property when a has access to the subset Sa of data en- 
tries. We assume that information disclosure takes values in [0, 1], 
with indicating no disclosure and 1 indicating full disclosure. 
Generic disclosure functions, including posterior beliefs, and dis- 
tribution distances can be naturally represented by SPARSI. The 
only requirement is that the disclosure function returns a value for 
each sensitive property. 

Based on the disclosure functions of all adversaries we define the 
overall disclosure function / as an aggregate of all functions in F. 
Before presenting the formal definition, we define the assignment 
set, given as input to /. 

Definition 1 (Assignment Set). Let x da be an indicator 
variable set to 1 if data entry d £ D is published to adversary 
a £ A. We define the assignment set S to be the set of all variables 

Xda> I. 

e., S = {xii, ■ ■ ■ , 3;i|A|, ■ • ■ ,x\r,\\A\}, and the adversary's 
assignment set Sa to be the set of indicator variables corresponding 
to adversary a £ A, i.e., Sa = {x\a, X2a, • • • , x^ri^a}- 



Worst Disclosure. The overall disclosure can be expressed as: 



in Ua, following the form: 



/oo(5) = maXaeAi\\fa{Sa)\\cxi) (1) 

Observe that using the infinity norm accounts for the worst case 
disclosure across properties. Thus, full disclosure for at least one 
sensitive property suffices to maximize the information leakage. 
This function is indifferent to the total number of sensitive prop- 
erties that are fully disclosed in a particular partitioning and gives 
the same score to all that have at least one fully disclosed property. 

However, there are cases where one is not interested in the worst 
case disclosure but only interested in the total information disclosed 
to any adversary. Following this observation we introduce another 
variation of the overall disclosure function that considers the total 
information disclosure per adversary. 

Average Disclosure. We replace the infinity norm in the equation 
above with the Li norm: 

/L,(5)=maa;.g^(Mipi^) (2) 

Observe that both Equation [T] and Equation|2]consider the maxi- 
mum over the disclosure across adversaries, i.e., they can be written 
as: 

f{S) ^ maXaeAf'ai^a) (3) 

where f^Sa) = \\fa{Sa)\\oo or f^Sa) = 

2.3 Overall Utility 

Let u denote the utility derived by partitioning the dataset across 
multiple adversaries. We have that u : 'P{D x yl) — > R^, where 
Vi^D X A) denotes the powerset of possible data-to-adversary as- 
signments. As we show below, this function either quantifies the 
utility of the adversaries when acquiring part of the dataset D (see 
Example[TJ or the publisher's utility derived by fulfilling a particu- 
lar task that requires partitioning the data (see Example|2]l. Under 
many real world examples these two different kinds of utility can 
be unified under a single utility. Consider Example [T| Typically, 
advertisers pay higher amounts for data that provide higher utility. 
Thus, maximizing the utility of each individual advertiser maxi- 
mizes the utility (maybe monetary) of the data publisher as well. 

Based on this observation we unify the two types of utilities un- 
der a single formulation based on the utility of adversaries. First 
we focus on the utility of adversaries. Intuitively, we would expect 
that the more data an adversary receives, the less the observation of 
a new, previously unobserved, data entry would increase the gain 
of the adversaries. This notion of diminishing returns is formalized 
by the combinatorial notion of submodularity and is shown to hold 
in many real- world scenarios |18||14| . More formally, a set func- 
tion G : 2^ — >■ E mapping subsets A Q V into the real numbers 
is called submodular |4|, if for all yl C B C V, and v' ^V\B, 
it holds that G{A U {«'}) - G{A) > G{B U {«'}) - G(B), i.e., 
adding v' to a set A increases G more than adding v' to a superset 
B of A. F is called nondecreasing, if for all ^ C _B C V it holds 
that G{A) < G{B). 

Let Ua be a set function that quantifies the utility of each adver- 
sary a. As mentioned above, we assume that Ua is a nondecreasing 
submodular function. For convenience we will occasionally drop 
the nondecreasing qualification in the remainder of the paper. Let 
Ua denote the set of all utility functions for a given set of adver- 
saries A. The overall utility function it can be defined as an aggre- 
gate function of all utilities Ua G Ua- We require that u is also 
a submodular function. For example the overall utility may be de- 
fined as a linear combination, i.e., a weighted sum, of all functions 



U{S) = V WaUaiSa) (4) 

^ — ^a£A 

where <S and 5a are defined in Definition [T] Since all functions in 
Ua are submodular, u will also be submodular, since it is expressed 
as a linear combination of submodular functions (7). 

An example of a submodular function Ua is an additive function. 
Assume that each data entry in d £ D has some utility Wda for an 
adversary a £ A. We have that Ua{Sa) ~ X^deu '^daXda, where 
Xda is an indicator variable that takes value 1 when data entry d is 
revealed to adversary a and otherwise. For the remainder of the 
paper we will assume that utility it is normalize so that it G [0, 1]. 

3. PRIVACY-AWARE DATA PARTITIONING 

In this section, we describe two formulations of the privacy- 
aware partitioning problem. We show how both can be expressed 
as maximization problems that are, in general, NP-hard to solve. 
We consider a dataset D that needs to be partitioned across a given 
set of adversaries A. We assume that each data entry must be re- 
vealed to at least one and at most t adversaries. The lower bound 
arises naturally in both application discussed in the introduction 
section. The upper bound is necessary to model cases where the 
number of assignments per data entry needs to be restricted as it 
might incur some cost, e.g., monetary in crowdsourcing applica- 
tions. 

We assume that the functions to compute the overall utility and 
information disclosure are given. Let these functions be denoted 
by u and / respectively. Ideally, we wish to maximize the utility 
while minimizing the cost; however, there is a natural tradeoff be- 
tween the two optimization goals. A traditional approach is to set 
a requirement on information disclosure while optimizing the gain. 
Accordingly we can define the following optimization problem. 

Definition 2 (DiscBudget). Let D be a set of data en- 
tries, A be a set of adversaries, and ti be a budget on informa- 
tion disclosure. This formulation of the privacy-aware partitioning 
problem finds a data entry to adversary assignment set S that max- 
imizes u{S) under constraint f{S) < tj. More formally we have 
the following optimization problem: 

maximize u{S) 

SeV(DxA) 

subject to f{S) <Ti, 

i<y"'' xda<tydeD, 

^ — ^a — 1 
Xda G {0,1}. 

where Xda and S are defined in Definition^as before and t > 1 
is the maximum number of adversaries to whom a particular data 
entry can be published. 

This optimization problem already captures our desire to reduce 
the information disclosure while increasing the utility. However, 
depending on the value of r/, the optimization problem presented 
above might be infeasible. Infeasibility occurs when rj is so small 
that no assignment of data to adversaries, such that X^a^i ^da > 
1, Vd G D and /(5) < rj exists. 

To overcome this, we consider a different formulation of the 
privacy-aware data partitioning problem where we seek to maxi- 
mize the difference between the utility and the information disclo- 
sure functions. We consider the Lagrangian relaxation of the pre- 
vious optimization problem. Again, we assume that both functions 
are measured using the same unit. We have the following: 



DEnNITION 3 (Tradeoff). Let D be a set of data entries, 
Abe a set of adversaries, and n be a budget on information disclo- 
sure. This formulation of the privacy-aware partitioning problem 
finds a data entry to adversary assignment S that maximizes the 
tradeoff between the overall utility and the overall information dis- 
closure, i.e., u{S)+X{ti ~ f(S)), where A is a nonnegative weight. 
More formally we have the following optimization problem: 

maximize u^S) + A(r/ — 

SeV(Dx\A\) 

subject to 1 < ^ Xda <t,\/d£ D, 

xda e {0,1}. 

where Xda and S are defined in Def. ^and t is the maximum num- 
ber of adversaries to whom a data entry can be published. 

We prove that both problems above are NP-hard by reducing both 
versions of the privacy-aware data partitioning problem to the hy- 
pegraph coloring problem 1 3 1 . 

Theorem 1 . Both formulations of the privacy-aware data par- 
titioning problem are, in general, NP-hard to solve. 

Proof. Fix a set of adversaries denoted by A and a set of data 
entries denoted by D. Let P denote the sensitive properties that 
correspond to data entries in D. We require to partition D across 
the adversaries in A. Consider the following instance of the privacy- 
aware data partitioning problem. We require that each data entry be 
published to exactly one adversary. Moreover, we set the maximum 
budget on information disclosure to be 1 . We also fix the overall in- 
formation disclosure to be a step function of the following form: If 
all the data entries corresponding to a particular sensitive property 
are revealed to the same adversary the overall disclosure is 1 other- 
wise it is 0. Finally we consider a constant utility function, which 
is always equal to 1 . Considering the hypergraph representation of 
data and properties, it is easy to see that this problem is equivalent 
to the hypergraph coloring problem, which is NP-hard |3 |. Revers- 
ing the above steps, one can easily reduce any instance of hyper- 
graph coloring to the privacy-aware data partitioning problem. □ 

In the remainder of the paper we describe efficient heuristics for 
solving the partitioning problem - we present approximation algo- 
rithms for specific information disclosure functions in Section |4] 
and a greedy local-search heuristic for the general problem in Sec- 
tion|5] Due to space constraints, henceforth, we will only focus on 
the Tradeoff formulation. Many of our algorithms also work for 
the DiSCBUDGET formulation (with slight modifications). 

4. ANALYZING SPECIFIC CASES OF IN- 
FORMATION DISCLOSURE 

In this section, we present approximation algorithms when the 
information disclosure function takes the following special forms: 
1) step functions, 2) linearly increasing functions. The utility func- 
tion is assumed to be submodular. 

4.1 Step Functions 

Information disclosure functions that correspond to a step func- 
tion can model cases when each sensitive property p G P is either 
fully disclosed or fully private. A natural application of step func- 
tions is the crowdsourcing scenario shown in Example |2] When 
certain fields of a medical transcription, e.g., name together with 
diagnosis, or gender together with the zip code and birth data, are 
revealed to an adversary the corresponding sensitive property for 
the patient is revealed. 



We continue by describing such functions formally. Let Dp C D 
be the set of data entries associated with p. Property p is fully dis- 
closed only if Dp is published in its entirety to an adversary. This 
can be modeled using a set of step functions fa G F: fa {Da)[p\ = 
1, if the set of data items assigned to adversary a contains all the el- 
ements Dp associated with property p. If Dp % Da, then fa{Da) = 
0. Observe that information disclosure is minimized (and is equal 
to 0) when no adversary gets all the elements in Dp, for all p. For 
step functions we consider worst case disclosure, since ideally we 
do not want to fully disclose any property. 

Considering DiSCBUDGET and TRADEOFF separately is not mean- 
ingful for step functions. Since disclosure can only take the extreme 
values {0, 1}, r/ = 1 in TRADEOFF should be set to 0. Hence, full 
disclosure of a property always penalizes the utility. Hence, one 
can reformulate the problem and seek for solutions that maximize 
the utility function under the constraint that information disclosure 
is 0, i.e., no property exists such that all its corresponding data en- 
tries are published to the same adversary. 

Given these families of information disclosure functions and a 
submodular utility function, both formulations of privacy-aware 
data partitioning can be represented as an integer program (IP): 

maximize u(S) 

SeV(Dx\A\) 

subject to Xda < \Dp\,\/p e Pya e A, 

'^d&Dp (7) 

i<y'' xda<tydeD, 

^ — 'a— 1 
Xda G {0,1}. 

where t is the maximum number of adversaries to whom a particu- 
lar data entry can be published. 

The first constraint enforces that there is no full disclosure of a 
sensitive property. The partitioning constraint enforces that a data 
entry is revealed to at least one but no more than t adversaries. 
Solving the optimization problem in jTj corresponds to maximiz- 
ing a submodular function under linear constraints. Recall that the 
utility function is submodular and observe that all constraints in the 
optimization problem presented above are linear. In fact be viewed 
as packing constraints. 

For additive utihty functions (u = ^at^A ^d^D ^daXda), Equa- 
tion [7] becomes an integer linear program, that can be approxi- 
mately solved in PTIME in two steps. First, one can solve a lin- 
ear relaxation of Equation|7] where Xda is some fraction in [0, 1]. 
The resulting fractional solution can be converted into an integral 
solution using a rounding strategy. 

The simplest rounding strategy, called randomized rounding (I9| , 
works as follows: assign data entry d to an adversary a with proba- 
bility equal to Xda, where Xda is the fractional solution to the linear 
relaxation. The value of the objective function achieved by the re- 
sulting integral solution is in expectation equal to the optimal value 
of the objective achieved by the linear relaxation. Moreover, ran- 
domized rounding preserves all constraints in expectation. A dif- 
ferent kind of rounding, called dependent rounding [8], ensures that 
constraints are satisfied in the integral solution with probability 1. 
For an overview of different randomized rounding techniques and 
the quality of the derived solutions for budgeted problems we refer 
the reader to the work by Doerr et al. | 4 | 

One can solve the general problem with worst-case approxima- 
tion guarantees by leveraging a recent result on submodular maxi- 
mization under multiple linear constraints by Kulik et al. 



Theorem 2. Let the overall utility function u be a nondecreas- 
ing submodular function. One can find a feasible solution to the 
optimization problem in Q with expected approximation ratio of 
(1 — e)(l — e~^),for any e > 0, in polynomial time. 

Proof. This holds directly by Theorem 2.1 of Kuliketal. [15 ]. □ 

To achieve this approximation ratio, Kulik et al. introduce a 
framework that first obtains an approximate solution for a contin- 
uous relaxation of the problem, and then uses a non-trivial combi- 
nation of a randomized rounding procedure with two enumeration 
phases, one on the most profitable elements, and the other on the 
'big' elements, i.e., elements with high cost. This combination en- 
ables one to show that the rounded solution can be converted to 
a feasible one with high expected profit. Due to the intricacy of 
the algorithm we refer the reader to Kulik et al. (15| for a detailed 
description of the algorithm. 

4.2 Linearly Increasing Functions 

In this section, we consider linearly increasing disclosure func- 
tions. Linear disclosure functions can naturally model situtations 
where each data entry independently affects the likelihood of dis- 
closure. In particular, if normalized log-likelihood is used as a mea- 
sure of information disclosure, the disclosure function takes the lin- 
ear form presented below. We consider the following additive form 
for the disclosure of property p: 



(8) 



where adp is a weight associated with the information that is dis- 
closed about property p when data d is revealed to an adversary. 
We can rewrite the TRADEOFF problem statement as: 



maximize u(S) + X(tj — max(f'^(Sa))) 



subject to 



Xda G {0, 1} 



Xdc 



< t,\fd e D, 



(9) 



When the utility function is additive, the above problem is an in- 
teger linear program, and hence can be solved by rounding the LP 
relaxation as explained in the previous section. 

However, for general submodular u(-), the objective is not sub- 
modular anymore - the max of the additive information disclosure 
functions is not necessarily supermodular |7|. Hence, unlike the 
case of step functions, we cannot use the result of Kulik et al. y_5J 
to get an efficient approximate solution. 

Nevertheless, we can compute approximate solutions in PTIME 
by considering the following max-min formulation of the problem: 



maximize 

SeV(DxA) 



mm(u(S) + X(ti 



.fUSa))) 



subject to 1 < Xda <t,ydG D, 

^ — ^ n,—^ 



(10) 



Xda e {0,1}. 

Since the overall utility function is a nondecreasing submodular 
function, and the disclosure for each adversary is additive, the ob- 
jective now is a max-min of submodular functions. More precisely, 
for worst-case disclosure (Equation [T]l, the optimization objective 
can be rewritten as: 

US)+\(ti- fa{Sa)\p\)) 



maximize 

5eP{i3xA) 



mm 



(11) 



and, for average disclosure (Equation|2]l, it can be written as: 



maximize 



min(n(5) + \{t, - ^ ,p /45a)[p])) 



The above max-min problem formulation is closely related to the 
max-min fair allocation problem ]10| for both types of informa- 
tion disclosure functions. The main difference between Problem 
and the max-min fair allocation problem is that data items may 
be assigned to multiple advrsaries. In the max-min fair allocation 
problem a data item is assigned exactly once. If t = 1 then the 
two problems are equivalent, and thus, one can provide worst case 
guaranties on the quality of the approximate solution. The prob- 
lem of max-min fair allocation was studied by Golovin 1 10], Khot 
and Ponnuswami |13|, and Goemans et al. |9|. Let n denote the 
total number of data entries (goods in the max-min fair allocation 
problem) and m denote the number of adversaries (buyers in the 
max-min fair allocation problem). The first two papers focus on 
additive functions and give algorithms achieving an (n — m + 1)- 
approximation and a (2m + 1) -approximation respectively, while 
the third one gives a 0{n^m^ lognloga m)-approximation. 

4.3 Quadratic Functions 

In this section we consider quadratic disclosure functions of the 
following form: 



M-)[p] = (E 



ddpXda 



(13) 



where adp and Xda are defined as before. Since fa{Dp)[p\ = 1 
we have that (X^deo '^dp)^ = 1. We assume that the utility is 
an additive function following the form of Equation |4] and do not 
consider generic submodular case. 

The Tradeoff optimization function can be rewritten as: 

maximize WdaXda + Hti - ma.x{fa{Sa))) 

SeV(DxA) ^deD^a&A ^ aeA^"^ ^ "' 

(14) 

where /„ (■) is a quadratic function. The internal maximization over 
information disclosure functions can be removed from the objective 
by rewriting the optimization problem as: 



maximize 

5eP{I3xA) 



E7 WdaXda + A(ri — y) 

deD ■^aeA ^ ' 



subject to y > f'^{Sa),'ia G A, 

i<y"'^ xda<t.ydeD, 

^ — ^a — 1 
Xda G {0,1}. 



(15) 



Since all constraints are either linear or quadratic, the above prob- 
lem is an integer 0-1 Quadratic Constraint Problem (QCP) [?], 
which is, in general, NP-hard to solve. In order to derive an ap- 
proximate solution in PTIME, we relax this problem to an equiv- 
alent Second Order Conic Programming (SOCP) problem [?]. A 
SOCP problem can be solved in polynomial time to within any 
level of accuracy by using an interior-point method [?] resulting 
in a fractional solution Xda- 

First, we focus on the constraints shown in Problem ^15\ . The 
constraints for worst and average disclosure can be written as: 



= (AapXap)"^AapXa, Vp G P, Vfl G A 



h J2pep(J2d^n.. "<*P^<'-)' = (ApXa)"'ApX,, \/aeA 



(16) 



(12) 



where Xa corresponds to a vector representation of the assignment 
set 5a and Ap — [a'l, a'2, . . . , a'^u\\ a positive vector that con- 
tains the appropriate weights for all data entries with respect to all 
properties p £ P. 

Both constraints follow the quadratic form (Ax)^Ax, where x is 
a vector representation of the assignment set Sa and A is a matrix 



containing the appropriate Aap's or Ap's based on the type of in- 
formation disclosure we are using. Let C denote the total number 
of constraints with respect to the disclosure function. Observe that 
we have C — |-P| j j4j and C — | A| for the two cases of information 
disclosure respectively. 

The next step is to incorporate variable y in the optimization 
problem. For that we extend vector x to include variable y. The new 
variable vector is [y x]^. We can rewrite the quantities in Equation 
I16las follows: 



A 



A 



Vc G [C] (17) 



maximize 

9 



where A and x are as defined above. Finally, we have that the equiv- 
alent SOCP problem to the initial QCP problem is: 

[ -A W ] q + Ar/ 
subject to [1 ] q > (A'q)"^(A'q), Vc G [C], (18) 
l<y"'' xaa<t,W€D. 

^ — ^a — 1 

where q = {y x ]"^, A' = [ A ]. 

Finally, the fractional solutions Xda obtained from the SOCP 
needs to be converted to an integral solution. We point out that for 
the Tradeoff problem no guarantees can be derived on the value 
of the objective function of the integral solution, when naive ran- 
domized rounding schemes, such as setting ajda = lwithPr[Xad = 
1] = Xda are used. Thus, finding a rounding scheme that ensures 
that the objective of the integral solution is equal to that of the frac- 
tional solution in expectation is an open problem. 

5. A GREEDY LOCAL SEARCH 
HEURISTIC 

So far we studied specific families of disclosure functions to 
derive worst-case guaranties for the quality of approximate solu- 
tions. In this section we present two greedy heuristic algorithms 
suitable for any general disclosure function. We still require the 
utility function to be submodular. Our heuristics are based on hill 
climbing and the Greedy Randomized Adaptive Local Search Pro- 
cec/t(re(GRASP) |6|. Notice that local search heuristics are known 
to perform well when maximizing a submodular objective function 
|7|. Again, we only focus on the TRADEOFF optimization prob- 
lem(see Equation[6]l. 

Algorithm 1 Overall Algorithm 
1: Input: A: set of adversaries; G: objective function; 

r: number of repetitions; t: max. adversaries per data item 
2: Output: Mopt- a data-to-adversary assignment matrix 
3: for all i = 1 ^ r do 

4: Mf) ^ empty assignment , gopt Gi^M^) 

5: (Mi„i,(?™) ^ CONSTRUCTION(G,^,t); 

6: {M, g) ^ LOCALSEARCH(G, A, t, M,„„ g™); 

7: if g > gopt then 

8: Mopt ^ M; gopt g; 

9: return Mopt; 



5.1 Overall Algorithm 

Our algorithm proceeds in two phases (Algorithm [TJ. The first 
phase, which we call construction, constructs a data-to-adversary 
assignment matrix Mini by greedily picking assignments that max- 
imize the specified objective function G( ), i.e., the tradeoff be- 
tween utility and information disclosure, while ensuring that each 



data item is assigned to at least one and at most t adversaries. The 
second phase, called local-search, searches for a better solution in 
the neighborhood of the Mini, by changing one assignment of one 
data item at a time if it improves the objective function, resulting 
in an assignment M. The construction algorithm may be random- 
ized; hence, the overall algorithm is executed r times, and the best 
solution Mopt ~ argmaX|^j^ jy^^jG(Mi) is returned as the final 
solution. 

Algorithm 2 CONSTRUCTION 
1: Input: G: objective function; A: set of adversaries; 

t: max. adversaries per data item 
2: Output: (A/, g) : data-to-adversary assignment, objective 
value 



max Iterations t ■ |_D 

Initialize: M empty assignment 

foralH G [l,maxIterations] do 

// Compute a set of candidate assignments 

Dm ^ data entries assigned to < t adversaries in M; 

Let S Dm x A- M 

// Pick the next best assignment that improves the objective 
{d,a)-ir- PickNextBest(M, 5, 5", G) 
if {d, a) is NULL then 

break; // No new assignments improve the objective 
Assign the selected data entry d to the selected adversary a; 
return {M,G{M)); 



5.2 Construction Phase 

The construction phase (Algorithm[2| starts with an empty data- 
to-adversary assignment matrix and greedily adds a new {d, a) as- 
signment to the mapping M if it improves the objective function. 
This is achieved by iteratively performing two steps. The algorithm 
first computes a set of candidate assignments 5*. For any data item 
d (which does not already have t assignments), and any adversary 
a, {d, a) is a candidate assignment if it does not appear in M. 



Algorithm 3 PickNextBest 



1: Input: G: objective function; M: current assignment; 

g: current value of objective, 5": possible new assignments 

2: Output: new assignment {d* , a*), or NULL 

3: GREEDY: 

4: {d*,a*) argmax^^ ^^ g<5G(M U {d, a)) 

5: GRASP: 

6: Pick the top-n assignments Sn having the highest value for 

g(d,a) = G(M U {d, a}) from S, and having g(d,a) > g- 

7: {d* ,a*) is drawn uniformly at random from Sn 

8: if G(A/ U (d,a)) > .g then 

9: return (d, a) 

10: else 

11: return NULL 

Second, the algorithm picks the next best assignment from the 
candidates (using PickNextBest, Algorithm [3j. We consider 
two methods for picking the next best assignment - GREEDY and 
GRASP. The GREEDY strategy picks {d*,a*) that maximizes the 
objective G(M U (d* , a* ) ) . On the other hand, GRASP identifies a 
set S„ of top n assignments that have the highest value for the ob- 
jective g{d,a) = G{M U (d, a)), such that g(d,a) is greater than the 
current value of the objective g. Note that Sn can contain less than 
n assignments. The GRASP strategy picks an assignment {d* , a*} 
at random from 5„. Both strategies return NULL if {d* ,a*) does 



not improve the value of the objective function. The construction 
stops when no new assignment can improve the objective function. 
Complexity: The run time complexity of the construction phase is 
0{t ■ \A\ ■ l-Dp). There are 0{t ■ \D\) iterations, and each itera- 
tion may have a worst case running time of 0{\D\ ■ \A\). PlCK- 
NextBest permits a simple parallel implementation. 



Algorithm 4 LOCALSEARCH 
1: Input: G: objective function; A: set of adversaries; 

t: max. assignments per data item; AI: current assignment; 

g: current objective value 
2: Output: {Mopt, Qopt}'- the new assignment, the corresponding 

objective value 
3; for all d e D do 

4: Ad the set of adversaries to whom data item d is assigned 

(according to current assignment M); 
5; // Construct a set of neighboring assignments 
6: Nd ^ {PI}. 

7: if(|Adl <t)thenNd^ NdU{Mu{{d,a')}\\/a' ^ Ad}; 

8: for each adversary a £ Ad do 

9: Nd^ NdU{M -{{d,a)}} 

10: Nd^ NdU{M - {{d, a)} U {(d,a')}|Va' ^ Ad}; 

1 1 : //Pick the neighboring assignment with maximum objective 

12: Mopt ^ argmax^^,gjv^G(Af') 

13: M ^ Mopt 

14: return {Mopt, G {Mopt)); 



5.3 Local-Search Phase 

The second phase employs local search ( Algorithm|4]l to improve 
the initial assignment Mini output by the construction phase. In 
this phase, the data items are considered exactly once in some (ran- 
dom) order. Given the current assignment M, for each data item, 
a set of neighboring assignments Nd (including M) are considered 
by (i) removing an assignment to an adversary a in M, (ii) modify- 
ing the assignment from adversary a to an adversary a' (that d was 
not already assigned to in AI), and (iii) adding a new assignment (if 
d is not already assigned to t adversaries in M). Next, the neigh- 
boring assignment in Nd with the maximum value for the objective 
Mopt is picked. The next iteration considers the data item succeed- 
ing d (in the ordering) with Mopt as the current assignment. We 
found that making a second pass of the dataset in the local search 
phase does not improve the value of the objective function. 
Complexity: The run time complexity of the local-search phase is 
0{t-\A\ ■ \D\). 

5.4 Extensions 

The construction phase ( Algorithm|2j has a run time that is quadratic 
in the size of D. This is because in each iteration, the PlCK- 
NextBest subroutine computes a global maximum assignment 
across all data-adversary pairs. While this approach makes the al- 
gorithm more effective in avoiding local minima it reduces its scal- 
ability due to its quadratic cost. 

To improve scalability, one can adopt a local myopic approach 
during construction. Instead of considering all possible (data,adversary) 
pairs when constructing the list of candidate assignments (see Ln. 8 
in Algorithm |2j, one can consider a single data entry d and popu- 
late the set of candidate assignments 5* using only (data, adversary) 
pairs that contain d. More specifically, we fix a total ordering of the 
data entries O, and perform t iterations of the following: 



• Consider the next data item d in O. Let M be the current as- 
signment. 

• Construct S us ({d} x A) - M . 

• Pick the next best assignment in S using Algorithm[3](GREEDY 
or GRASP) that improves the objective function. 

• Update the current assignment M, and proceed with the next 
data entry in O. 

Complexity: The run time complexity of the myopic-construction 
phase isO{t- \A\ ■ \D\). 

5.5 Correctness 

While both the construction and local search phases carefully 
ensure that each data item is assigned to no more that t adversaries, 
we still need to prove that each data item is assigned to at least one 
adversary. To ensure this lower bound on the cardinality, we use 
the following objective function G: 

G{-)=u{-) + \{ri- fi-))-C (19) 

where C is the number of data items that are not assigned to any ad- 
versary and A G [0, 1]. The above objective function adds a penalty 
term C to the tradeoff between the utility and the information dis- 
closure, i.e., u(-) + \(ti — /(■)). We can show that this penalty 
ensures that every data item is assigned to at least one adversary. 
We have the following theorem: 

Theorem 3. UsingG{-) = u{-) + X{ti - f{-))-Cwith\€ 
[0, 1] as the objective function in Algorithm^returns a solution 
where all cardinality constraints are satisfied. 

Proof. Both the construction and the local-search phases en- 
sure that no adversary is assigned to more than t adversaries (Ln. 7 
in Algorithms|2]and|4]l. 

We need to show that every data item is assigned to at least one 
adversary, i.e., at the end of the algorithm C will be equal to 0. 
We will focus on the global Algorithm [T| The proof for the local 
version of the algorithm is analogous. 

The main intuition behind the proof is the following: 

• At the end of the construction phase, if C > 0, then some data 
item must be assigned to > t adversaries. 

• In the local-search phase, making a data item unassigned never 
results in a better objective function. 

At the start of the construction phase, C = We show that if 
at some iteration C = i, then in that iteration some unassigned data 
item is assigned to an adversary and C reduces by 1 . If C = i > 0, 
there are three possible paths the algorithm can follow after the it- 
eration is over: (1) no new assignment is chosen, (2) the algorithm 
chooses to assign a data entry to an adversary so that the number of 
violated constraints remains the same (i.e. that data entry is already 
assigned to at least one adversary), and (3) a data-to-adversary as- 
signment is chosen so that C = i — 1. We will show that given the 
objective function we are using, the first path will never be chosen. 

We evaluate the value of the objective function for paths 1 and 3. 
For the third path we consider the worst case scenario, where only 
disclosure is increased (by A/) and utility remains the same. The 
value of the objective function for paths 1 and 3 is as follows: 

Gi = u + \{ti -f)-C 

G3=U + \{Tl-f-Af)-{C-l) 

Since A/ < 1, G3 > Gi, thus, path 1 will never be taken if C > 0. 

Notice, that the objective values for paths 2 and 3 cannot be di- 
rectly compared. However, we will show that during the t\D\ it- 
erations we perform path 3 will be chosen \D\ times, and, hence 



C = at the end of Algorithm[2] Let ni, 722 and ns denote the 
number of times pathl, path 2 and path 3 are chosen respectively. 
By construction we have that ni + n2 + — t\D\. After t\D\ 
iterations if C = we are done. If C > then based on the above 
statement we will have ni — 0, and thus n2 + — t x \D\. Now 
we will show that 713 will always be equal to \D\. Let 713 < \D\, 
we have that 712 = tD — 713 > — 1) | _D| . Recall that when path 2 
is chosen, the algorithm assigned to an adversary a data entry is al- 
ready assigned to another adversary. Therefore, if 712 > {t — l)\D\, 
then some data item is assigned to> t adversaries, which does not 
happen. Therefore, 77,3 = \D\. 

To prove that C = at the end of the local-search phase, it suf- 
fices to show that the algorithm will never choose to remove an as- 
signment from a data item assigned to exactly one adversary. Con- 
sider the state of the local-search algorithm right before an iteration 
of its main for-loop (Ln. 3-13 of Algorithm |4j. We assume that the 
algorithm considers a data-entry d with a single assignment and 
that C — 0. Moreover, let u and / be the utility and disclosure at 
that point. Let u' and /' denote the utility and disclosure if d is 
assigned to no adversary. Consider the best-case scenario where no 
utility is lost, i.e., u' = u, and /' — 0. Note that C = 1. The new 
objective value will be It' + A(r/ — /') —C — u + Xtj — 1. We have 
that u + \ti — 1 < It + \ti — f, since / < 1 always holds. □ 



6. EXPERIMENTS 

In this section we present an empirical evaluation of SPARSI. 
The main questions we seek to address are: (1) how the two ver- 
sions of the privacy-aware partitioning problem - TRADEOFF and 
DiSCBUDGET- compare with each other, and how well they ex- 
ploit the presence of multiple adversaries with respect to disclosure 
and utility, (2) how the different algorithms perform in optimizing 
the overall utility and disclosure, and (3) how practical SPARSI is 
for distributing real-world datasets across multiple adversaries. 

We empirically study these questions using both real and syn- 
thetic datasets. After describing the data and the experimental method- 
ology, we present results that demonstrate the effectiveness of our 
framework on partitioning sensitive data. The evaluation is per- 
formed on an Intel(R) Core(TM) 15 2.3 GHz/64bit/8GB machine. 
SPARSI is implemented in MATLAB and uses MOSEK, a com- 
mercial optimization toolbox. 

Real-World Dataset: For the first set of experiments we present 
how SPARSI can be applied to real world domains. We considered 
a social networking scenario as discussed in Example [T| We used 
the Brighkite dataset published by Cho et al. |2|. This dataset was 
extracted from Brightkite, a former location-based social network- 
ing service provider where users shared their locations by checking- 
in. Each check-in entry contains information about the id of the 
user, the timestamp and location of the check-in. The dataset con- 
tains the public check-in data of users and the friendship network of 
users. The original dataset contains 4.5 million check-ins, 58, 228 
users and 214, 078 edges. We subsampled the dataset and ex- 
tracted a dataset comprised of 365, 907 check-ins. The correspond- 
ing friendship network contains 3, 266 nodes and 2, 935 edges. In 
Section [6T| we discuss how we modeled the utility and information 
disclosure for this data. 

Synthetic Data: For the second set of experiments we used syn- 
thetically generated data to understand the properties of different 
disclosure functions and the performance of the proposed algo- 
rithms better. There are two data-related components in our frame- 
work. The first is a hypergraph that describes the interaction be- 
tween data entries and sensitive properties (see Section [2T| l, and 
the second is a set of weights Wda representing the utility received 



when data entry d € D is published to adversary a £ A. The syn- 
thetic data are generated as follows. First, we set the total number 
of data entries I D| G {50, 100, 200, 300, 500}, the total number of 
sensitive properties |P| £ {5, 10, 50, 100}, and the total number 
of adversaries \ A\ £ {2, 3, 5, 7, 10}. 

Next, we describe the scheme we used to generate the utility 
weights Wda- There are two particular properties that need to be 
satisfied by such a scheme. The first one is that assigning any entry 
to an adversary should induce some minimum utility, since it al- 
lows us to fulfil the task under consideration (see Example|2](. The 
second one is that there are cases where certain data items should 
induce higher utilities when assigned to specific adversaries, e.g., 
some workers may have better accuracy than others in crowdsourc- 
ing, or advertisers may pay more for certain types of data. 

The utility weights need to satisfy the aforementioned proper- 
ties. To achieve this, we first choose a minimum utility value Umin 
from a unifrom distribution W(0, 0.1). Then, we iterate over all 
possible data-to-adversary assignments and set the corresponding 
weight Wda to a value drawn from a uniform distribution U (0.8, 1) 
with probability pu, or to itmin with probability 1 — p^. For our ex- 
periments we set the probability p„ to 0.4. Notice that both prop- 
erties are satisfied. Finally, weights are scaled down by dividing 
them with the number of adversaries | j4|. 

Next, we describe how we generate a random hypergraph H = 
{X, E), with |X| = \D\ and \E\ = |P|, describing the interaction 
between data entries and sensitive properties. To create H we sim- 
ply generate an equivalent bipartite dependency graph G (see Sec- 
tion [O} and convert that to the equivalent dependecy hypergraph. 
In particular we iterate over the possible data to sensitive property 
pairs and insert the corresponding edge to G with probability pf. 
For our experiments we set p/ to 0.3. 

Algorithms: We evaluated the following algorithms: 

• RAND+: Each data entry is assigned to exactly t adversaries. 
The probability of assigning a data entry to an adversary is 
proportional to the corresponding utility weight Wda- We run 
the random partitioning a 100 times, and select the data-to- 
adversary assignment that maximizes the objective function. 

• LP: We solve the LP relaxation of the optimization problems for 
step (Section pt. l[ l and linear (Section [4.2^ disclosure functions. 
We generate an integral solution from the resulting fractional 
solution using naive randomized rounding (see Section |4. 1^ . 
Note that the constraints are satisfied in expectations. More- 
over, we perform a second pass over the derived integral solu- 
tion to guarantee that the cardinality constrains are satisfied. If 
a data item is not assigned to an adversary, we assign it to the 
adversary with the highest weight, i.e., corresponding fractional 
solution. On the other hand is a data item is assigned to more 
adversaries, we remove those with the lowest weight. This is a 
naive, yet effective, rounding scheme because the fractional so- 
lutions we get are close to the integral ones. More sophisticated 
rounding techniques can be used |4|. We run the rounding 100 
times and select the data-to-adversary assignment with maxi- 
mum value of objective. 

• ILP: We solve the exact ILP algorithm for step and linear dis- 
closure functions. 

• GREEDY: Algorithm [T| with GREEDY strategy for picking a 
candidate 

• GRASP: Algorithm[T]with GRASP strategy for picking the can- 
didate assignments using n = ^ and r = 10. 

• GREEDYL: Local myopic variant of Algorithm 1 (see Section 
|5.4| with GREEDY strategy for picking a candidate). 



• GRASPL: Local myopic variant of Algorithm 1 (see Section 
|5.4| with GRASP strategy for picking candidates) using n — 
min(fc, 3) and r = 10. 

Evaluation. To evaluate the performance of the aforementioned 
algorithms we used the following metrics: (1) the total utility u cor- 
responding to the final assigrmient, (2) the information disclosure 
/ for the final assignment and (3) the tradeoff between utility and 
disclosure, given by u + A(ti — /) . We evaluated the different 
algorithms using different step and linear information disclosure 
functions for TRADEOFF. 

For all experiments we set A = 1 and assume an additive util- 
ity function of the form Ua(Sa) ~ ^ — Sdgp ^da^da where 

Xda is an indicator variable that takes value 1 when data entry d 
is revealed to adversary a and otherwise, and top — t(A) returns 
the top t adversaries with respect to weights Wda- Observe that 
the normalization used corresponds to the maximum total utility a 
valid data-to-adversary assignment may have, when ignoring dis- 
closure. Using this value ensures that the total utility and the quan- 
tity rj — / have the same scale [0, 1]. For convenience we fixed the 
upper information disclosure to r/ — 0. Finally, for RAND+ we 
perform 10 runs and report the average, while for LP we perform 
the aforementioned rounding procedure 10 times and report the av- 
erage. The corresponding standard errors are shown as error bars 
in the plots below. 

6.1 Real Data Experiments 

We start by presenting how SPARSI can be applied to real-world 
applications. In particular, we evaluate the performance of the pro- 
posed local-search framework meta-heuristic on generic informa- 
tion disclosure functions using Brightkite. As described in the be- 
ginning of the section, this dataset contains the check-in locations 
of users and their corresponding friendship network. As illustrated 
in Example [T| we desire to distribute the check-in information to 
advertisers, while minimizing the information we disclose for the 
structure of the network. We, first, show how SPARSI can be used 
in this scenario. 

Utility Weiglits. We start by modeling the utility provided when 
advertisers receive a subset of data entries. As mentioned above 
each check-in entry contains information about location. We as- 
sume a total number of k advertisers, so that each adversary is par- 
ticularly interested in check-in entries that occurred in a certain ge- 
ographical area. Given an adversary a € A, we draw Wda from a 
uniform distribution W(0.8, 1) for all entries d £ D that satisfy the 
location criteria of the adversary, and Wda ~ 0.1 otherwise. We 
simulate this process by performing a random partitioning of the 
location ids across adversaries. As mentioned above we assume an 
additive utility function. 

Sensitive Properties and Information Disclosure. The sensi- 
tive property in the particular setup is the structure of the social 
network. More precisely, we require that no information is leaked 
about the existence of any friendship link among users. It is easy to 
see that each friendship link is associated with a sensitive property. 
Now, we examine how check-in entries leak information about the 
presence of a friendship link. Cho et al. f2\ proved that there is 
a strong correlation between the trajectory similarity and the exis- 
tence of a friendship link for two users. Computing the trajectory 
similarity for a pair of users is easy and can be done by computing 
the cosine similarity of the users given the published set of check- 
ins. Because of this strong correlation we assume that the informa- 
tion leakage for a sensitive property, i.e., a the link between a pair 
of users, is equal to the trajectory similarity. 

More precisely, let Da G D he the check-in data published to 



adversary a £ A. Let U denote the set of users referenced in Da. 
Given a sensitive property p = e{ui, Uj), Ui, Uj £ U,i ^ j we 
have that the information disclosure for p is: 

f{Da)[p] = CosineSimilarity(Da(ui),-Da(itj)) (20) 

where Da {ui ) and Da {uj ) denote the set of check-in data for users 
Ui and Uj respectively. We aggregate the given check-ins based on 
their unique pairs of users and locations, and we extract 15, 661 
data entries that contain the user information, the location and the 
number of times that user visited that particular location. Cosine 
similarity is computed over these counts new data entries. 

Results. As mentioned above, we aim to minimize the infor- 
mation leaked about any edge in the network. We model this re- 
quirement by considering the average case information leakage and 
setting rj = 0. In particular, we would like that no information 
is leaked at all if possible. Moreover, to partition the dataset we 
solve the corresponding TRADEOFF problem. Since we consider 
cosine similarity we are limited to using RAND-l- and one of the 
local-search heuristics. In particular, we compare the quality of 
the solutions for RAND+, GREEDYL and GRASPL. Due to the 
fact that our implementation of GREEDY and GRASP is single- 
threaded, these algorithms do not scale for this particular task, and 
thus, are omitted. However, as illustrated later (see Section \62\ , 
these myopic algorithms are very efficient in minimizing the infor- 
mation leakage, thus, suitable for our current objective. Again, we 
run experiments for |^| G {2, 3, 5, 7, 10}. 

First, we point out that under all experiments GREEDYL and 
GRASPL reported data-to-adversary assignments with zero disclo- 
sure. This means that our algorithms were able to distribute the 
data entries across adversaries so that no information at all is leaked 
about the structure of the social network (i.e., friendship links) to 
any adversary. On the other hand the average information dis- 
closure for RAND-F ranges from 0.99 almost full disclosure to 0.1 
as the number of adversaries varies from 2 to 10 (see Table [TJ. 
Disclosing the structure of the entire network with probability al- 
most 1, violates our initial specifications, hence, RAND+ fails to 
solve the problem when the number of adversaries is small. As the 
number of adversaries increases, the average amount of disclosed 
information decreases. 

We continue our analysis and focus on the utility and the tradeoff 
objective. The corresponding results are shown in Figure|2] Figures 
|2(a)| and |2(b)| correspond to the utility and utility-disclosure trade- 
off respectively. The corresponding disclosure is shown in Table 
[T] As shown, for a small number of adversaries both GREEDYL 
and GRASPL generate solutions that induce low values for the total 
utility. This is expected since both algorithms give particular em- 
phasis to minimizing information disclosure due to the tradeoff for- 
mulation of the underlying optimization problem. As the number 
of adversaries increases both algorithms can exploit the structure 
of the problem better, and offer solutions that induce utility values 
that are comparable or higher than the ones reported by RAND+. 

However, looking only at utility values can be misleading, as a 
very high utility value might also incur a high disclosure value. If 
fact, RAND+ follows this behavior exactly. The high utility data to 
adversary assignments, when the number of adversaries is small, is 
associated with almost full disclosure of the structure of the entire 
social network. This is captured in Figure [2(b)l where we see that 
local-search algorithms clearly outperform RAND+ since no infor- 
mation is disclosed (see Table[TJ. As shown in this figure, in most 
cases, the average objective value for RAND+ is significantly lower 
than the ones reported by both GREEDYL and GRASPL. Recall 
that we compute the average over multiple runs of RAND+, where 
for each run we execute the algorithm multiple times and consider 



Table 1: Average information disclosure reported by RAND+, 
GREEDYL and GREEDYL for Brightkite. Notice that local- 
search algorithms generate solutions that reveal no information 
about the structure of the friendship network. Standard errors 
are reported in the parenthesis. 
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crease is due to randomization. This is more obvious if we contrast 
the performance of GRASPL with GRASP. We can see that ran- 
domization leads to worse solutions (with respect to utility) when 
keeping a myopic view on the given optimization problem. On 
the other hand randomization is helpful in the case of non-myopic 
local-search. Recall that the reported numbers correspond to no in- 
formation disclosure, while missing values correspond to full dis- 
closure of at least one sensitive property. As depicted GREEDY 
failed to find a valid solution when splitting the data across 2 ad- 
versaries. However, when randomization was used, GRASP was 
able to find the optimal solution. Similar performance for different 
values of |D|. These results are omitted due to space constraints. 

Next, we ran a second set of experiments to investigate how the 
performance of the proposed heuristics with respect to the amount 
of disclosed information. Recall that all local-search algorithms 
do not explicitly check for infeasible values of disclosure for step 
functions, since they optimize the tradeoff between utility and dis- 
closure. Therefore, they might report infeasible solutions. We fixed 
the number of sensitive properties to |P| — 50, and the number of 
adversaries to fc = 2, and we varied the number of data items \D\. 
We considered \D\ G {100, 200, 300, 500}. We observed the same 
behavior as in the previous experiment, i.e., all local-search heuris- 
tics failed to report feasible solutions. 



Figure 2: Tradeoff objective and utility for Brightkite. RAND+ 
generates solutions with high utility but almost full disclo- 
sure (see Table [ij. This leads to a poor tradeoff value. 
GREEDYL and GRASPL disclose no information, and outper- 
form RAND+ with respect to the overall optimization objective. 



the best solution reported. The large error bars are indicative of the 
non-robustness of RAND-l- for this problem. 

6.2 Synthetic Data Experiments 

In this section, we present our results based on the synthetic data. 
We examined the behavior of the proposed algorithms under sev- 
eral scenarios, where we varied the properties of the dataset to be 
partitioned, the number of adversaries and the family of disclosure 
functions considered. 

Step Functions. We started by considering step functions. Un- 
der this family of disclosure functions, both DiSCBUDGET and 
Tradeoff correspond to the same optimization problem (see Sec- 
tion |4. 1^ . Moreover, assuming feasibility of the optimization prob- 
lem, information disclosure will always be zero. In such cases con- 
sidering the total utility alone is sufficient to compare the perfor- 
mance of the different algorithms. However, when information is 
disclosed, comparing the number of fully disclosed properties al- 
lows us to evaluate the performance of the different algorithms. 

First, we fixed the number of data entries in the dataset to be 
\D\ — 500 andconsidered valuesof |P| in {5, 10, 50, 100}. Figure 
[3]shows the utility derived by the data-to-adversary assignment cor- 
responding to different algorithms for |Pj — 50. As depicted, all 
algorithms that exploit the structure of the dependency graph while 
solving the underlying optimization problem (i.e., LP, GREEDY, 
GREEDYL, GRASP and GRASPL) outperform RAND+. In most 
cases, LP GREEDYL, GREEDY and GRASP where able to find 
the optimal solution that ILP reported. The high performance of 
the LP algorithm is justified by the fact that the fractional solution 
reported was in fact an integral solution. 

GRASPL found solutions with non-optimal utilities, which are 
still better than RAND+. We conjecture that this performance de- 
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Figure 3: Utility for step disclosure functions. All reported 
numbers correspond to no information disclosure, while miss- 
ing values to full disclosure. 

The corresponding results are presented in Table|2] As the num- 
ber of data entries was increasing for a fixed number of sensitive 
properties, the number of fully disclosed properties started decreas- 
ing. This behavior is expected as the number of data entries per 
single property increases, and hence, it is easier for our heuristics 
to find a partitioning where the data items for a single property are 
partitioned across adversaries inducing zero disclosure. 



Table 2: Fully disclosed properties for data-to-adversary as- 
signments for step disclosure functions. As the number of data 
entries per property increases, local-search heuristics exploit 
the structure of the problem better and report solutions with 
fewer fully disclosed properties. 
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The experiments above show that GREEDY and GRASP are vi- 
able alternatives for the case of step functions. However, for harder 
instances of the problem, where the number of sensitive proper- 
ties is large and the number of adversaries is small, solving the LP 
relaxation offers a more robust and reliable alternative. 

Linear Functions. We continue our discussion and present our 
experimental results for linear functions. First, we compared the 
quality of solutions produced when solving DiSCBUDGET and TRADE- 
OFF optimally for both worst and average disclosure. We gener- 
ated a synthetic instance of the problem by setting \D\ — 50 and 
IP] = 10, and we run ILP for \A\ = {2, 3, 5, 7, 10}. We set the 
maximum allowed disclosure to rj = 0.9 and t — 2. 

The utility and corresponding disclosure are shown in Figure 
|4]when DiSCBUDGET and TRADEOFF are solved optimally. As 
shown, the worst case disclosure remains at the same levels across 
different number of adversaries. Now, consider the case of average 
disclosure (see Equation [2]l. We see that for both versions of the 
optimization problem the disclosure is decreasing as the number of 
adversaries increases. However, the optimization corresponding to 
Tradeoff is able to exploit the presence of multiple adversaries 
better, to reduce disclosure while maintaining utility high. 

Consequently, we evaluated RAND-I-, LR GREEDYL, GRASPL, 
GREEDY and GRASP on solving TRADEOFF. We set upper dis- 
closure Ti — denoting our requirement to minimize disclosure as 
much as possible. We do not report any results for the ILP since 
for \ D\ > 50, it was not able to terminate in reasonable time. First, 
we fixed the number of properties to |P| = 50 and considered 
instances with I D| = {100,200,300,500}. We considered av- 
erage disclosure. We show the performance of the algorithms for 
\D\ = 500 in Figure|5] As shown, LR GREEDY and GRASP out- 
perform RAND-F both in terms of the overall utility and the average 
disclosure. In fact RAND-l- performs poorly as it returns solutions 
with higher disclosure but lower utility than the LP. Furthermore, 
we see that the performance gap between the local-search algo- 
rithms and RAND-l- keeps increasing as the number of adversaries 
increases. This is expected as the proposed heuristics take into ac- 
count the structure of the underlying dependency graph, and, hence, 
can exploit the presence of multiple adversaries to achieve a higher 
overall utility and lower disclosure. 

Furthermore, we see that solutions derived using the LP approx- 
imation provide the largest utility, while solutions derived using 
the proposed local-search algorithms minimize the disclosure. As 
presented in Figure |5] using the myopic construction returns solu- 
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Figure 4: The (a) utility and (b) disclosure when solving Dis- 
cBuDGET and Tradeoff optimally. Tradeoff can exploit 
the presence of multiple adversaries better, to reduce disclosure 
while maintaining high utility. 



tions with low overall utility. This behavior is expected since the 
algorithm does not maintain a global view of the data-to-adversary 
assignment. Observe that randomization improves the quality of 
the solution with respect to total utility, when the global- view con- 
struction is used (see GREEDY and GRASP). When the myopic 
construction is used, randomization gives lower quality solutions. 

Measuring the average disclosure is an indicator for the over- 
all performance of the proposed algorithms. However, it does not 
provide us with detailed feedback about the information disclosure 
across properties for the different algorithms. To understand what 
is the exact information disclosure for the solutions returned by the 
different algorithms, we measured the total number of properties 
that exceeded a particular disclosure level. We present the corre- 
sponding plots for \D\ = 500, \P\ = 50, fc = 2 and fc = 7 
in Figure |6] As shown, the proposed search-algorithms can ex- 
ploit the presence of multiple adversaries very effectively in order 
to minimize disclosure. If we compare Figures [6(a)| and [6(a)l we see 
that the total number of properties reaches zero for a significantly 
smaller disclosure threshold in the presence of ten adversaries. 
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Figure 6: The number of properties that exceed a particular 
disclosure level for \D\ = 500 and |P1 = 50. GREEDYL and 
GRASPL can exploit the presence of multiple adversaries more 
effectively to minimize disclosure. The total number of proper- 
ties reaches zero for a significantly smaller disclosure threshold 
in the presence of ten adversaries. 



7. RELATED WORK 

There has been much work on the problem of publishing (or al- 
lowing aggregate queries over) sensitive datasets (see surveys jT] 
|5j). Here, information disclosure is characterized by a privacy def- 
inition, which is either syntactic constraints on the output dataset 
(e.g., fc-anonymity |20] or ^-diversity |17|), or constraints on the 
publishing or query answering algorithm (e.g., e-differential pri- 
vacy [5 1). Each privacy definition is associated with a privacy level 
(k, £, e, etc.) that represents a bound on the information disclo- 
sure. Typical algorithmic techniques for data publishing or query 
answering, which include generalization, or coarsening of values, 
suppression, output perturbation, and sampling, attempt to maxi- 
mize the utility of the published data given some level of privacy 
(i.e., a bound on the disclosure). Rrause et al. |14| consider the 
problem of trading-off utility for disclosure, and consider general 
submodular utility and supermodular disclosure functions. This pa- 
per formulates a submodular optimization problem, and presents 
efficient algorithm for the same. However, all the above techniques 
assume that all the data is published to a single adversary. Even 
when multiple parties may ask different queries, prior work makes 
a worst-case assumption that they arbitrarily collude. On the other 




Figure 5: Tradeoff objective, utility and disclosure for linear functions considering average disclosure {\D\ = 500, P — 50). LP, 
GREEDY and GRASP outperform RAND+ both in terms of total utility and average disclosure. LP maximizes utility, while the 
local-search heuristics are the most effective in minimizing disclosure. 



hand, in this paper, we formulate the novel problem of multiple 
non-colluding adversary, and develop near-optimal algorithms for 
trading-off utility for information disclosure in this setting. 

8. CONCLUSIONS AND FUTURE WORK 

More and more sensitive information is released on the Web 
and processed by online services, naturally raising concerns re- 
lated to privacy in domains where detailed and fine-grained infor- 
mation must be published. In this paper, motivated by applications 
like online advertising and crowd-sourcing markets, we introduce 
the problem of privacy-aware fc-way data partitioning, namely, the 
problem of splitting a sensitive dataset among k untrusted parties. 
We present SPARSI a theoretical framework that allows us to for- 
mally define the problem as an optimization of the tradeoff between 
the utility derived by publishing the data and the maximum infor- 
mation disclosure incurred to any single adversary. Moreover, we 
prove that solving it is NP-hard by reducing it to hypergraph par- 
titioning. We present a performance analysis of different approxi- 
mation algorithms for a variety of synthetic and real- world datasets, 
and demonstrate how SPARSI can be applied in the domain of on- 
line advertising. Our algorithms are able to partition user-location 
data to multiple advertisers while ensuring that almost no sensitive 
information about potential friendship links about these users can 
be inferred by any advertiser. 

Our research so far has raised several interesting research di- 
rections. To our knowledge, this is the first work that leverages 
the presence of multiple adversaries to minimize the disclosure of 
private information while maximizing utility. While we provided 
worst case guarantees for several families of disclosure functions, 
an interesting future direction is to examine if rigorous guarantees 
can be provided for other widely-used information disclosure func- 
tions like information gain, or if the current ones can be improved. 
Finally, it is of particular interest to consider how the proposed 
framework can be extended to consider interactive scenarios where 
data are published to adversaries more than once, or in streaming 
data where the partitioning must be done in an online maimer. 
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